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Abstract 

Recent increase in online privacy concerns prompts the following question: can a recommendation 
engine be accurate if end-users do not entrust it with their private data? To provide an answer, we 
study the problem of learning user ratings for items under local or 'user-end' differential privacy, a 
powerful, formal notion of data privacy. 

We develop a systematic approach for lower bounds on the complexity of learning item struc- 
ture from privatized user inputs, based on mutual information. Our results identify a sample com- 
plexity separation between learning in the scarce information regime and the rich information 
regime, thereby highlighting the role of the amount of ratings (information) available to each user. 

In the information-rich regime (where each user rates at least a constant fraction of items), 
a spectral clustering approach is shown to achieve optimal sample complexity. However, the 
information-scarce regime (where each user rates only a vanishing fraction of the total item set) 
is found to require a fundamentally different approach. We propose a new algorithm, MaxSense, 
and show that it achieves optimal sample complexity in this setting. 

The techniques we develop for bounding mutual information may be of broader interest. To 
illustrate this, we show their applicability to (i) learning based on 1-bit sketches (in contrast to 
differentially private sketches), and (ii) adaptive learning, where queries can be adapted based on 
answers to past queries. 

Keywords: Differential privacy, reccomender systems, lower bounds, partial information 

1. Introduction 

Recommender systems are fast becoming one of the cornerstones of the Internet; in a world with 
ever increasing choices, they are one of the most effective ways of matching users with items. To- 
day, many websites (Amazon, Netflix, Yahoo, etc.) use some form of such systems, and research 
into these algorithms received a fillip from the recently concluded Netflix prize competition. Iron- 
ically, the contest also exposed the Achilles heel of such systems, when Narayanan and Shmatikov 
(2006) demonstrated that the Netflix data could be de-anonymized. Subsequent works such as 
Calandrino et al. (2011) have reinforced belief in the frailty of these algorithms in the face of pri- 
vacy attacks. 

To design recommender systems in such scenarios, we first need to define what it means for a 
data-release mechanism to be private. The popular perception has coalesced around the following 
notion: a person can either participate in a collaborative filtering system and waive all claims to 
privacy, or avoid such systems entirely. The response of the research community to these concerns 
has been the development of a third paradigm, between complete exposure and complete silence. 
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This new approach has been captured in the formal notion of differential privacy (refer Dwork 
(2006)); essentially it suggests that although perfect privacy is impossible, one can control the leak- 
age of information by deliberately corrupting sensitive data before release. The original definition 
in Dwork (2006) provides a statistical test that must be satisfied by a data-release mechanism to be 
private. Accepting this paradigm shifts the focus to designing algorithms that obey this constraint 
while maximizing relevant notions of utility. This trade-off between utility and privacy has been 
explored for several problems in database management (refer to Blum et al. (2005); Dwork (2006); 
Dwork etal. (2006, 2010a,b)) and learning (refer to Blum et al. (2008); Chaudhuri et al. (2011); 
Gupta et al. (2011); Kasiviswanathan et al. (2008); McSherry and Mironov (2009); Smith (2011)). 

In the context of recommender systems, there are two models for ensuring privacy: centralized 
and local. Under the centralized model the recommender system is trusted to collect data from 
users; it then responds to queries by publishing results that have been corrupted via any mechanism 
that obeys the differential privacy constraint. However, users increasingly desire control over their 
private data given the mistrust in systems with centrally stored data-misgivings that are supported 
by examples such as the Netflix privacy breach. In cases where the database cannot be trusted to 
keep data confidential, users can store their data locally, and differential privacy is ensured through 
suitable randomization at the 'user-end' before releasing data to the recommender system. This is 
precisely the context of the present paper: the design of differentially private algorithms within the 
setting of untrusted recommender systems. 

The latter model is variously known in privacy literature as local differential privacy (see Kasiviswanathan et al. 
(2008); we henceforth refer to it as local-DP ), and also in statistics as the 'randomized response 
technique' (see Warner (1965)). However, there are two unique challenges to local-DP posed by 
recommender systems which have not been dealt with before: 

1. The underlying space (here, the set of ratings over all items) has very high dimensionality. 

2. The users have limited information: they rate only a (vanishingly small) fraction of items. 

In this work we address both these issues. Assuming an unknown cluster structure for the items, 
we demonstrate a surprising change in the sample complexity of private learning algorithms when 
shifting from information-rich to information-scarce settings. No similar phenomenon is known 
for non-private learning. With the aid of new information-theoretic arguments, we provide lower 
bounds on the sample complexity in various regimes. On the other hand, these arguments also guide 
us in developing novel algorithms, particularly in the information-scarce setting, which match the 
lower bounds upto logarithmic factors. Thus although we pay a 'price of privacy' when ensur- 
ing local-DP in untrusted recommender systems with information-scarcity, we can design optimal 
algorithms under such regimes. 

1.1. Our Results 

We now present a high level view of our technical results, and discuss their relevance to the problem 
of designing algorithms for untrusted recommender systems. As mentioned before, we focus on 
learning a stochastic generative model for the data, under user-end, or local differential privacy 
constraints. This entails a subtle difference in the definition of utility as compared to centralized 
differential privacy. In the latter, the true model may be known to the database curator, but privacy 
constraints require the output to be perturbed; the performance measure is the size of database 
required to output a hypothesis that is private and close to the truth. In contrast, local differential 
privacy ensures privacy at the user-end; the aim of the system is to learn the model from privatized 
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responses to appropriately designed queries, and the performance is in terms of the number of users 
needed for learning. 

More precisely, we aim at learning a partition of the items into clusters within which items are 
statistically identical. The hypothesis class (i.e., set of models) is the set of functions from items [N] 
to cluster labels [L] (where typically L « N), and thus has size L . Further, we assume that each 
user has rated only w items out of the possible N. For a learner to be successful, we require that 
it identify the correct cluster label for all items 1 . Our starting point is then given by the following 
basic lower bound (for exact definitions, see Section 2) 

Informal Theorem 1 (Theorem 5) For any (finite) hypothesis class H to be 'successfully' learnt 
under e-dijferential privacy, the number of users must satisfy: Ulb = & ' ' ' 1 



The above theorem is based on a standard use of Fano's inequality in statistical learning. Returning 
to the recommender system problem, note that log \ %\ = @(N). In the information-rich setting (i.e., 
where w = Q(N)), we show the above bound is matched (up to logarithmic factors) by a local-DP 
algorithm based on a novel 'pairwise-preference' sketch and spectral clustering techniques: 



Informal Theorem 2 (Theorem 6) In the information-rich (IR) regime, clustering via the Pairwise- 
Preference Algorithm succeeds if the number of users exceeds: U^p = C 

In practical scenarios w is quite small; for example, in a movie ratings system, users usually have 
seen and rated only a small fraction of the set of movies. Our main results in the paper concern 
non-adaptive, local-DP learning in the information-scarce regime (where w = o(N)). Herein, we 
observe an interesting phase-change in the sample complexity of private learning: 

Informal Theorem 3 In the information-scarce (IS) regime, the sample complexity of non-adaptive, 
local-DP cluster learning is lower bounded by (Theorem 9): U[ B = 0, (^^r^J- Furthermore, for 

small w (in particular, w = o(N3)), we have (Theorem 10): U[ B = $7 ( 



Finally for the IS regime, we develop a new class of algorithms based on a novel sketch, that, under 
certain separation conditions, matches the above lower bound upto logarithmic factors: 

Informal Theorem 4 (Theorem 11 ) For a given w, clustering under the MaxSense Algorithm ( Sec- 
tion 5) is successful if the number of users exceeds a threshold given by : Ums = " x 



Techniques: Our main technical contribution lies in the tools we use for the lower bounds. By view- 
ing the privacy mechanism as a noisy channel with certain constraints, we are able to use information 
theoretic methods to obtain bounds on private learning. Although these connections between pri- 
vacy and mutual information have been considered in previous works (refer McGregor et al. (2010); 
Alvim et al. (2011)), our work is novel in that: a) it illustrates its application to problems in private 
learning (via Fano's inequality), and b) it shows how non-trivial bounds can be obtained via care- 
ful analysis of the information leakage in private mechanisms. Towards the latter, we formalize a 
notion of 'channel mis-alignment' between the 'sampling channel' (the partial ratings submitted by 
users) and the privatization channel. In Section 4 we provide a structural lemma (Lemma 7) that 
quantifies this mismatch under general conditions, and demonstrate its use by obtaining tight lower 



1. in Appendix A we also treat the case where we allow a fraction of item misclassifications. 
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bounds under 1-bit (non-private) sketches. In Section 4.2 we use it to obtain tight lower bounds 
under local-DP. In Section 6 we discuss its application to adaptive local-DP algorithms, establishing 
a lower bound of order Q(NlogN), which also refines Informal Theorem 1. Though we focus on 
the item clustering problem, the lower bounds thus obtained apply to learning any finite hypothesis 
class under privacy constraints, and offer scope for further extensions. 

The information theoretic results also suggest that 1-bit privatized sketches are sufficient for 
learning in such scenarios. Based on this intuition, we show how existing spectral-clustering tech- 
niques can be extended to private learning in some regimes. More significantly, in the information- 
scarce regime, where spectral learning fails, we develop a novel algorithm based on blind probing 
of a large set of items. This algorithm, in addition to being private and having optimal sample com- 
plexity in many regimes, triggers several interesting open questions, which we discuss in Section 6. 

1.2. Related Work 

Privacy preserving recommender systems: The design of recommender systems with differential 
privacy was studied by McSherry and Mironov (2009) under the centralized model. Like us, they 
separate the recommender system into two components, a learning phase (based on a database 
appropriately perturbed to ensure privacy) and a recommendation phase (performed by the users 
'at home', without interacting with the system). They numerically compare the performance of 
the algorithm against non-private algorithms. In contrast, we consider a stronger notion of privacy 
(local-DP), and for our generative model, are able to provide tight analytical guarantees and further, 
quantify the impact of limited information on privacy. 

Private PAC Learning and Query Release: Several works have considered private algorithms 
for PAC-learning. Blum et al. (2008); Gupta et al. (2011) consider the private query release prob- 
lem (i.e., releasing approximate values for all queries in a given class) in the centralized model. 
Kasiviswanathan et al. (2008) show equivalences between: a) centralized private learning and ag- 
nostic PAC learning, b) local-DP and the statistical query (SQ) model of learning; this line of work 
is further extended by Beimel et al. (2010). Although some of our results (in particular, Theorem 
5) are similar in spirit to lower bounds for PAC (see Kasiviswanathan et al. (2008); Beimel et al. 
(2010) there are significant differences both in scope and technique. Furthermore: 

1. We emphasize the importance of limited information, and characterize its impact on learning 
with local-DP. Hitherto unconsidered,information scarcity is prevalent in practical scenarios, 
and as our results shows, it has strong implications on learning performance under local-DP . 

2. Via lower bounds, we provide a tight characterization of sample complexity, unlike Kasiviswanathan et al. 
(2008); Blum et al. (2008); Gupta et al. (2011), which are concerned with showing polyno- 
mial bounds. This is important for high dimensional data sets. 

Privacy in Statistical Learning: Chaudhuri et al. (2011) consider privacy in the context of em- 
pirical risk minimization; they analyze the release of classifiers, obtained via algorithms such as 
SVMs, with (centralized) privacy constraints on the training data. Though they provide performance 
guarantees, they do not provide related lower bounds. Dwork and Lei (2009) study algorithms 
for privacy-preserving regression under the centralized model; these however require running time 
which is exponential in the data dimension. Smith (2011) obtains private, asymptotically-optimal 
algorithms for statistical estimation, again though, in the centralized model. 

Other Notions of Privacy: The local-DP model which we consider has been studied before in pri- 
vacy literature (Kasiviswanathan et al. (2008); Dwork et al. (2006)) and statistics (Warner (1965)). 
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It is a stronger notion than central differential privacy, and also stronger than two other related no- 
tions: pan-privacy (Dwork et al. (2010b)) where the database has to also deal with occasional release 
of its state, and privacy under continual observations (Dwork et al. (2010a)), where the database 
must deal with additions and deletions, while maintaining privacy. 

Recommendation algorithms based on incoherence properties: Apart from privacy -preserving 
algorithms, there is a large body of work on designing recommender systems under various con- 
straints (usually low-rank) on the ratings matrix (for example, Wainwright (2009); Keshavan et al. 
(2010)). These methods, though robust, fail in the presence of privacy constraints, as the noise 
added as a result of privatization is much more than their noise-tolerance. This is intuitive, as suc- 
cessful matrix completion would constitute a breach of privacy; our work builds the case for using 
simpler lower dimensional representations of the data, and simpler algorithms based on extracting 
limited information (in our case, 1-bit sketches) from each user. 

2. Preliminaries 

We now define our system model, the notion of differential privacy, and tools from information 
theory that form the basis of our techniques. We use [N] to denote the set {1, 2, ... , N}. 

2.1. Recommender Systems 

In this paper we consider a specific statistical model wherein items are assumed to have an underly- 
ing cluster structure, and user affinities for items depend only on the clusters they belong to. In this 
setting, the primary objective of the recommender engine is to learn these clusters (and then reveal 
them to the users, who can then compute their own recommendations privately). Our model, though 
simpler than the state of the art in recommender engines, is still rich enough to account for many of 
the features seen empirically in recommender systems. In addition it yields reasonable accuracy in 
non-private settings on meaningful datasets (see Tomozei and Massoulie (2011)). 

We thus assume that there is an underlying clustering of users and items into several classes, 
such that the affinity of a user for an item is only a function of the user's class and the item's class 
(this is akin to a bipartite version of the Stochastic Blockmodel of Holland et al. (1983), widely used 
in model selection literature). Let [U] be the set of U users and [N] the set of N items. The set of 
users is divided into K clusters labelled as C u = {1, 2, . . . , K}, where cluster i contains aiU users. 
Similarly, the set of items is divided into L clusters C n = {1, 2, . . . , L}, where cluster £ contains 
fteN items. We use A to denote the matrix of user/item ratings, where each row corresponds to a 
user, and each column an item. For simplicity, we assume Aij G {0, 1}; for example, this could 
correspond to 'like/dislike' ratings. Finally we have the following statistical model for the ratings: 
for user u E U with user class k, and item n € [N] with item class I, the rating A un is given by a 
Bernoulli random variable A un ~ Bernoulli (bke), where the ratings by users in the same class, and 
for items in the same class, are i.i.d. 

In order to model limited information, i.e., the fact that users rate only a fraction of all items, 
we define a parameter w to be the number of items a user has rated (more generally, we only need 
bounds for w-for example, we could have w = Q(f(N)) for some function /). We assume that the 
rated items are picked uniformly at random. We characterize w = Q(N) as the information-rich 
regime and w = o(N) as the information-scarce regime. 

When considering lower bounds, we will specialize this model to the situation where there is 
only one user class (K = 1) and where users have perfect knowledge of the type of the items they 
rate. 
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2.2. Differential Privacy 

Differential privacy is a framework that, in its most general form, defines conditions under which 
an algorithm can be said to be privacy preserving with respect to the input. Formally we have: 

Definition 1 (e-Differential Privacy) A randomized function ^ : X — > y that maps data X 6 X 
toY^y is said to be e-dijferentially private (or e-DP) if, for all values y € y in the range space 
of^>, and for all 'neigboring' data x, x', we have that: 



We also assume that Y conditioned on X is independent of any external side information Z (in other 
words, the output of mechanism ^ depends only on X and its internal randomness). Furthermore, 
the definition of 'neighboring' is chosen according to the situation, and determines the data that 
remain private. In the context of ratings matrices, two matrices can be neighbors if: i) they differ in 
a single row (per-user privacy), or ii) if they differ in a single rating (per-rating privacy). 

We consider the local model of differential privacy, where privacy is ensured at the user-database 
boundary before the data is stored in the system. This paradigm is known in statistics as the 'Ran- 
domized Response' technique Warner (1965) (where it is used for collecting statistics for sensitive 
questions). For each user u, let X be its private data-in the recommendation context, the rated-item 
labels and corresponding ratings-and let Y be the data that the user makes publicly available to the 
untrusted engine. Then local-DP requires that the above condition holds, where any two private 
data (ratings vectors in our case) x and x' are deemed neighboring. It is thus the natural notion of 
privacy in the case of untrusted databases, as the data is privatized at the user-end before storage in 
the database; to emphasize this, we alternately refer to it as User-end Differential Privacy. 

We conclude this section with a mechanism for releasing a single bit under e-differential privacy. 
The proof of differential privacy for this mechanism is easy to check using equation (1). 

Proposition 2 ( e-DP bit release): Given a single bit S°, let output bit S be equal to S° with prob- 
ability j^p-, else equal to S = 1 — S°. Then the map S° — )• S is (locally) e-dijferentially private. 

2.3. Preliminaries from Information Theory 

For a random variable X taking values in some discrete space X, its entropy is defined as H(X) = 
YlxeX ~ = x ] log P[-X" = x] 1 . For two random variables X,Y, the mutual information be- 
tween them is given by: 



Our main tools for constructing lower bounds are variants of Fano's Inequality, which are 
commonly used in non-parametric statistics literature (refer Santhanam and Wainwright (2009); 
Wainwright (2009)). Consider a finite hypothesis class H, \H\ = M, indexed by [M]. Suppose 
that we choose a hypothesis H uniformly at random from {1,2,... , M}, sample a data set X^ 7 
of U samples drawn in an i.i.d. manner according to a distribution P^(H) (in our case, u € [U] 
corresponds to a user, and X u the ratings drawn according to the statistical model in Section 2.1), 

2. For notational convenience, we use log(-) as the logarithm to the base 2 throughout; hence, the entropy is in 'bits' 



P[Y = y\X = x] 



(1) 



F[Y = y\X = x'] 
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and then provide a private version of this data Xj 7 to the learning algorithm. We can represent this 
as the Markov chain: 

Sampling v Privatization Model Selection £> 

ti G H y A.^ y ■ y ti 

Further, we define a given learning algorithm to be unreliable for the hypothesis class % (and a 



H + H\H = h 



hypothesis drawn uniformly at random) if max^tw 

Fano's inequality provides a lower bound on the probability of error under any learning al- 
gorithm in terms of the mutual information between the underlying hypotheses and the samples. 
A basic version of the inequality is as follows (see Appendix A for a more general version with 
discussions): 

Lemma 3 (Fano's Inequality) Given a hypothesis H drawn uniformly from H, and U samples 
drawn according to H,for any learning algorithm, the average probability of error P e = F[H 7^ H] 
satisfies: 

Pe>l~ ^ : ,\> ■ (2) 
log (M) 

As a direct consequence of this result, if the samples are such that X(H\ X^ 7 ) = o(log M), then any 
algorithm fails to correctly identify almost all of the possible underlying models. Though this is a 
weak bound, equation 2 turns out to be sufficient to study sample complexity scaling in the cases 
we consider. In Appendix A, we consider stronger versions, as well as more general criterion for 
approximate model selection (i.e., with distortion). 

3. Clustering under Local-DP: The Information-Rich Regime 

In this section, we derive a lower bound on the number of users needed for accurate learning under 
local differential privacy. This relies on a simple bound on the mutual information between any 
database and its privatized output, and hence is applicable in very general settings. Returning to 
the clustering problem, we give an algorithm that matches the optimal scaling in N (up to some 
logarithmic factor) under one of the following two conditions: i) w = O(iV), i.e., each user has 
rated a constant fraction of items (the information-rich regime), or ii) only the ratings are private, 
not the identity of the rated items. 

We obtain a simple lower bound on the scaling required using the following lemma that charac- 
terizes a lower bound on the mutual information leakage across any differentially private channel. 
Equivalent statements of this lemma are given in Alvim et al. (2011); McGregor et al. (2010): 
Lemma 4 Given (private data) r.v. X G X, a privatized output Y G y obtained by any locally 
e—DP mechanism $ : X — y y, and any side information Z, we have: I(X; Y\Z) < e log e. 

Lemma 4 follows directly from the definitions of mutual information and differential privacy 
(note that for any such mechanism, the output Y given the input X is conditionally independent of 
any side-information). It suggests that any mechanism obeying DP results in an output which has 
at most e bits of information vis-a-vis the data. Returning to the private learning of item classes, 
we obtain a lower bound on the number of users needed by considering the following reduction: 
let Cn G % = {0, 1} N be the mapping of the item set [N] to two classes represented as {0, 1}; 
hence the size of the hypothesis class is now 2 . Recall that we defined a learning algorithm to be 
unreliable for % if max^g-^P Cn / Cn\Cn = h > \. Using Lemma 4 and Fano's inequality 
(Lemma 3), we get the following lower bound on the sample complexity. 
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Theorem 5 Suppose the underlying clustering CV is drawn uniformly at random from {0, 1}^. 
Then any learning algorithm obeying e-local-DP is unreliable if the number of queries U < ^ el ^ e 

Proof (Sketch) Similar to the assumptions in Section 2.3, we have the following model for each 

i j i i r^m n Sampling Privatization q „ , ^ rrrl „ ,. 
user (under local-DP): Cn > X- u > X u , for each u € [U\. Here sampling 

refers to each user rating a subset of w items. Now by the Data-Processing Inequality (Theorem 

2.8.1 from Cover and Thomas (2006), see Appendix A), we have that 

u 

I(C N ;X?) < J^P^^lXr 1 ) < Ueloge, 

u=l 

by Lemma 4. Fano's inequality (Lemma 3) then implies that a learning algorithm is unreliable if 



the number of queries satisfies: U < 



N 



We note that this is a simplified form of a more general theorem presented in Appendix A. Further, 
though this bound is not the strongest possible, it turns out to be achievable (up to logarithmic fac- 
tors) in the information-rich regime, as we show below. A similar bound was given by Beimel et al. 
(2010) for PAC-learning in the centralized setting using more explicit counting techniques. Such 
bounds fail to exhibit the correct scaling in the information-scarce case (w = o(N)) setting. The 
reason for this is that we use the Data-Processing inequality in the proof of Theorem 5, instead 
of jointly analyzing the interaction between the two channels (sampling and privatization). How- 
ever, unlike proofs based on simple counting arguments, our method allows us to leverage more 
sophisticated information theoretic tools for other variants of the problem, like those we consider 
subsequently in Section 4. 

To conclude this section, we outline an algorithm for clustering in the information-rich regime. 
The algorithm proceeds as follows: i) provide each user u with two items (i u ,ju) picked at random 
whereupon the user generates a private bit 5„ equal to 1 if it rated the two items positively, and else 
0, ii) let users release as a public sketch a privatized version S u of their private bit using the e-DP 
bit release mechanism, iii) construct matrix A whose (i, j) entry is obtained by adding the sketches 
S u of each user u queried with item pair and iv) perform spectral clustering of items based 

on matrix A and return the item classes. We refer to this as the Pairwise-Preference algorithm. 
The algorithm is formally specified in Appendix B. Its privacy is guaranteed by the use of e-DP bit 
release, while its performance analysis, given in Theorem 6, is based on a related result on spectral 
clustering by Tomozei and Massoulie (2011); details are given in Appendix B (in particular, the 
detailed separability conditions are given in Theorem 21). 

Theorem 6 The Pairwise-Preference algorithm is e-differentially private. Further, in the information- 
rich regime, under the separability assumptions on the model parameters (a&), (J3g) and (b^g) stated 
in Appendix B, there exists c > such that the item clustering is successful with high probability if 
the number of users satisfies: U > c (N log N). 

4. The Information-scarce Setting: Lower Bounds 

To get tighter lower bounds on the number of users needed to obtain an accurate item clustering, 
we need more accurate bounds on the mutual information between the underlying model expressed 
in terms of item clusters and the available privatized data. In Section 3 we developed a basic lower 
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bound by characterizing a constraint on the mutual information across any differentially private 
channel. We now develop some more refined techniques to study the impact of privatization in the 
presence of incomplete information. 

As in the previous lower bound, we consider a simplified version of the problem, where there 
is a single class of users, and each item is ranked either or 1 deterministically by each user (i.e., 
b u i = h 6 {0, 1} for all items). Let Cat(-) : [N] — > {0, 1} be the underlying clustering function; in 
general we can think of this as an iV-bit vector Z € {0, 1}^. We assume that the user-data for user 
u is given by X u = (I u , Z u ), where I u is a size w subset of [N] representing items rated by user 
u, and Z u are the ratings for the corresponding items; in this case, Z u = {Z(i)}j 6 / u . The set I u is 
assumed to be chosen uniformly at random from amongst all size-w subsets of [N] . We also denote 
the privatized sketch from user u as S u G S. Here the space S to which sketches belong is assumed 
to be an arbitrary finite or countably infinite space. The sketch is assumed e-differentially private. 
Finally, as before, we assume that Z is chosen uniformly over {0, 1}^. 

4.1. Local Differential Privacy and Mutual Information 

In this section, we establish the main lemma we use for bounding the mutual information under 
local-DP, and derive a result for the sample complexity of learning with 1-bit sketches, which builds 
intuition regarding the bounds in the next section. 

We define ( [ ^ ] ) to be the collection of all size-u> subsets of [N] = {1, 2, . . . , N}, V = ( [ ^ ] ) x 
{0, iy to be the set from which user information (i.e., (I, Z)) is drawn, and define D = \T>\ = 
( ) 2 W . Finally Ex [•] indicates that the expectation is over the random variable X. We now establish 
the following bound for the mutual information between the model and the sketch. This is a special 
case (for Z taking the uniform measure over {0, 1}^) of a more general lemma which we state and 
prove in Appendix C. 

Lemma 7 Given the Markov Chain Z — > (I,Z) —> S, let (ii, Z\), (I2, Z^) G T> be two pairs 
of 'user-data ' sets which are independent and identically distributed according to the conditional 
distribution of the pair (I, Z) given S = s. Then, the mutual information X(Z; S) satisfies: 

1(Z;S) < E 5 \E (hiZl) \ SMh>Za) \ s [^ hnI ^l {Zl ^ Z2} - 1 

where we use the notation ^-{Zi=Zi} to denote that the two user-data sets are consistent on the index 
set on which they overlap, i.e., t{ Zl =z 2 } - ^{Z 1 {i)=z 2 (i)^ahni2} 

Before deriving tighter lower bounds under local-DP, we first consider a related problem that 
demonstrates the effect of per-user constraints (as opposed to average constraints) on the mutual 
information. We consider the same item-class learning problem as before with w = 1 (i.e., each 
user has access to one rating), but instead of a privacy constraint, we consider a 'per-user bandwidth' 
constraint, wherein each user can communicate only a single bit to the learning algorithm. 

This demonstrates an interesting change in the sample complexity of learning with per-user 
communications constraints (maximum bandwidth in this section, and privacy in next section) ver- 
sus average-user constraints (mutual information bound or average bandwidth). In the former case 
as we will show, the sample complexity is 6(iV 2 ). In the latter case, the sample complexity with 
1-bit average bandwidth constraint is O (iVlog 2 N). Indeed, assume w = 1, and let users reveal 
their private data (J, Z) with probability l/log(A r ) and otherwise return a blank symbol. Then 
the average information released per user is 0(1), and by a coupon collector argument the original 
sequence Z^ is indeed retrieved after 0(N log 2 N) queries. 
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Theorem 8 Suppose w = 1, with (I, Z) drawn i.i.d uniformly over [N] x {0, 1}. Then for any 
1-bit sketch derived from (J, Z), it holds that: X(Z, S) = O (-^) , and consequently, there exists 
a constant c > such that any cluster learning algorithm using queries with 1-bit responses is 
unreliable if the number of users satisfies U < cN 2 . 

Proof (Sketch.) We first note that Z(Z,S) is a convex function of P[5 = s\Z = z] for fixed 
P[Z = z] (Theorem 2.7.4, Cover and Thomas (2006)). Thus, the mutual information is maximized 
at the extremal points of the kernel F[S = s\Z = z] which correspond to F[S = s\(i,z)] G 
{0, 1}, implying that the class of deterministic queries with 1-bit response that maximizes mutual 
information has the following structure: given user-data (I U ,Z U ), user u's response S u G {0, 1} 
is of the form S u = 1a{IuiZ u ), where A C {(i,z)\i G [N],z G {0,1}}. In other words, the 
algorithm provides user u with an arbitrary set A of (items, ratings), and the user identifies if (I u , Z u ) 
is contained in A. The mutual information lower bound follows from elementary manipulations. We 
then get the result from Lemma 7 and Fano's inequality (Lemma 3). ■ 

We note that this is a tight bound-a simple (adaptive) scheme is to ask random queries of the form 
"Is (I,Z) = (i, 6)?"(where i G [N] and b = {0, 1}). The average time between two successful 
queries is 2N, and one needs N successful queries to learn all the bits. 

4.2. Query Complexity Lower Bounds for Clustering under Local-DP 

We now exploit the above techniques to obtain lower bounds on the scaling required for accurate 
clustering with DP in an information-scarce regime, i.e., when w = o(N). We first obtain a weak 
lower bound in Theorem 9, valid for all w, and then refine it in Theorem 10 under some additional 
conditions. Refer to Appendix C for the complete proofs. 

Theorem 9 In the information-scarce regime, i.e., when w = o(N), under e-local-DP we have: 

T(Z,S) = O 



w 2 



N 

and consequently, there exists a constant c > such that any cluster learning algorithm with e- 
local-DP is unreliable if the number of users satisfies U < c(j£^\ . 

The above result shows how Lemma 7 can be used to obtain sharper bounds on the mutual infor- 
mation contained in a differentially private sketch in the information-scarce setting in comparison 
to Lemma 4. Using this result, we get a lower bound of ^(^r) on the number of samples needed 
to learn the underlying clustering. We now present a tighter bound on the mutual information under 
some conditions; it relies on a more careful evaluation of the bound in Lemma 7, but matches the 
performance of the algorithm we present in Section 5, thereby displaying its optimality. 

Theorem 10 Under the scaling assumption w = o(N 1 ^ 3 ), and for e < ln(2), it holds that 

I(Z,5) = o(^)- (3) 

and thus there exists a constant c > such that any cluster learning algorithm with e-local-DP is 
unreliable if the number of users satisfies U < c ' x 
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5. The Information-scarce Setting: Cluster Learning 

The sample complexity of the pairwise-preference algorithm in Section 3 does not match our lower 
bounds in an information-scarce setting. Indeed, the probability that two randomly probed items 
belong to the rated set of size w is 0(w 2 /N 2 ). The sample complexity is thus magnified from 
Q(Nlog(N)) in the information-rich regime to 0(JV 3 log(N)/w 2 ), which is polynomially larger 
than our lower bound for w = o(viV). We thus turn to the design of a new algorithm that achieves 
the sample complexity bound from Section 4. 

The MaxSense algorithm: As in Pairwise Preference, we use a (privatized) 1-bit sketch for learn- 
ing. A query to user u is formed by first constructing a random sensing vector H u = {H uv ) ne \m, 
whose entries H un = 1 if item n is being sensed, and otherwise; each entry is set to 1 in an i.i.d. 
manner with probability 9/w for some design parameter 9. User u then constructs a private sketch 
5„, which is the disjunction of its ratings for all items n that are being sensed (with unrated items 
given rating of 0): 5° = max ng nv] H un Z un , where Z un G {0, 1} equals 1 if user u rated positively 
item n. Finally, user u outputs a privatized version S u of its private sketch 5°. The sensing vector 
H u is known publicly, hence can be generated either by the user or by the engine querying the user. 

Based on the sketches S u and sensing vectors H u , the algorithm then determines per-item scores 
X n according to X n := Ylue[U] HunS u , n G [N], and performs /c-means clustering of these scores 
in R. A formal description of the algorithm is provided in Appendix D. Now we have the following: 

Theorem 11 The MaxSense algorithm is e-dijferentially private. Further, define 

K 



^ 2(e e - 1) 
e = -r— — -7T-, <W = mm 

(e e + 1) 1<£<£'<L 



J2 ake -ej: i=1 Pib k ^ bkt _ bui 

k=l 



where 9 is the parameter of the item sensing probability 9/w. Then for any d > 0, there exists a 
constant C > such that the clustering is successful with probability 1 — N~ d if the number of 
users satisfies: 



linn 



W 



5 mm here determines separability conditions on the problem: for example, using the notation 
Vk := Yle Pebke, it can be checked that 5 rn i n is strictly positive for all 9 (except on a set of measure 
0) provided the following condition holds: 

W ^ £' g [L],3k G [K] such that ^ aj (b jt - b je ) / 0. (4) 

j:Vj=v k 

Determining whether alternative schemes could achieve similar complexity under weaker separa- 
bility conditions is, for now, an open problem. 

6. Extensions and Conclusion 

Theorem 1 1 demonstrates that despite its simplicity, MaxSense is sufficient to achieve optimal scal- 
ing in N (up to logarithmic terms) under suitable separability condition. More generally, the al- 
gorithm suggests a general approach to dealing with partial information under local differential 
privacy. We now briefly discuss an extension to achieve a better 'privacy trade-off, namely, a 
- factor in the scaling required for accurate clustering. Under this extension, each user is asked 
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Q = [e _1 ] MaxSense questions, each with a privacy parameter of ^ in a way that ensures inde- 
pendence between answers. The user calculates Q sketches using the Q sensing vectors and reveals 
the privatized set of sketches (with each sketch being revealed via a <g-DP bit release mechanism). 
Finally, we calculate the item counts and perform clustering as before. The algorithm, which we 
call the Multi-MaxSense algorithm, is formally presented and analyzed in the Appendix. 

Adaptive queries: The lower bounds of Section 4 applied to non-adaptive learning, where queries 
to users are performed in parallel, without leveraging answers of users 1, . . . , u — 1 when querying 
user u. One can in fact extend these bounds to the adaptive setting where query to user u is allowed 
to depend on the previous queries and answers of users 1, . . . , u — 1. Specifically the following, 
shown in the Appendix E, holds. 

Theorem 12 Assume w = 1. If users' answers are e-DP, the number of adaptive queries needed 
to learn unknown content clustering into two types drawn uniformly at random from {0, 1}^ is 
0(iVlog N). 

The proof again relies on bounding the mutual information between the unknown clusters and a 
user's sketch, although now the mutual information conditional on the previous queries and their 
answers (i.e., of the form S^S 1 " -1 = s" -1 )) has to be considered. The first step applies an 
extension of Lemma 7 to bound this mutual information by the variance of a certain empirical sum 
iV _1 Yl n =i fn{Z n ) for bounded functions f n , under the distribution of Z conditional on S 1 " -1 = 
s" . The crux of the proof then consists in showing that, provided this conditional distribution is 
close to uniform (i.e., its entropy is > N — 6 for some 5 > 0), then the variance of this empirical 
sum under the conditional distribution is no larger than N~ 1 g(5) for some constant g{5). This 
intermediate result is of independent interest, and could enable extensions of the latter theorem, e.g. 
relaxing the assumption that w = 1. 

We leave it as a topic for further research to establish how sharp this lower bound is. In particu- 
lar, if it can be tightened to a lower bound of Q(N 2 ) and further extended to Q(N 2 /w) for 
this would imply that MaxSense is optimal even when one can use adaptive queries. If on the other 
hand there is a gap between non-adaptive and adaptive complexities, then this implies that schemes 
superior to MaxSense in the adaptive case have yet to be identified. 

In conclusion, we have initiated a study in the design of recommender systems under local-DP 
constraints. We have provided lower bounds on the sample complexity in both information-rich and 
information-scarce regime, quantifying the effect of limted information on private learning. Further, 
we showed tightness of these results by designing the MaxSense algorithm, which recovers the item 
clustering under privacy constraints with optimal sample complexity. The lower bound techniques 
naturally extend to cover model selection for more general (finite) hypothesis classes, while 1-bit 
sketches appear appropriate for designing efficient algorithms for the same. Development of such 
algorithms and analysis of matching lower bounds by leveraging and extending the techniques we 
introduced seem promising future research directions. 
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Appendix A. Differential Privacy and Mutual Information 

In this section we state and prove some basic results instrumental for our lower bounds. 
A.l. Differential Privacy and Mutual Information 

For the sake of convenience, we restate the definition of Differential Privacy: 

Definition 13 ((-Differential Privacy) A randomized function ^ : X — > y that maps data X £ X 
toY^y is said to be e- differentially private (or e-DP) if for all values y 6 y in the range space 
of^>, and for all 'neigboring' data x, x', we have that: 

F[Y = y\X = x] e 
F[Y = y\X = x'] ~ e 

In this work, we focus on the local model of differential privacy. The local model is formally 
defined in Kasiviswanathan et al. (2008). Informally, in the context of recommender systems, this 
means that the ratings of each user are assumed to be private data, and hence any information 
given by any user to the system is required to obey the above definition vis-a-vis the user's private 
data. The definition of neighboring databases for checking differential privacy depends on the exact 
information that needs privatization. In particular, we can consider the following two cases in the 
context of recommender systems: 

1. The items rated by a user are not considered private, but the ratings are. Now, given a set of w 

rated items with ratings (with, say, {0,1} ratings), the neighboring databases are all possible 

ratings vectors for these w items (hence, all vectors in {0, 1} W ). 
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2. Both ratings as well as rated items are considered private. Now, the neighbors of any set of 
rated items and ratings consists of all possible subsets of items, and all possible sets of ratings 
for this subset. 

We have considered the latter case throughout the paper. However, as mentioned in Section 3, the 
first case (where only ratings are private) can be handled by the Pairwise Preference algorithm. 
Furthermore, the basic lower bound of Section 3, Lemma 4 is also applicable, thereby giving a 
complete characterization of that case. 

Two crucial properties of differential privacy, which we use later in our proofs, are 'composi- 
tion' and 'post-processing' (refer to Dwork et al. (2006) for details). Composition defines how the 
privacy of the data scales upon the application of multiple differentially-private release mechanisms. 
Formally we have: 

Proposition 14 (Composition) If k outputs, {Yi, Yjj, ■ ■ ■ , Y^} are obtained from data X G X by 
k different randomized functions, {^i, \l/2, • • • > ^k}> where is ei-differentially private, then the 
resultant function is Yli=i e i differentially private. 

The post-processing property implies that processing the output of a differentially private release 
mechanism can only make it more differentially private (i.e., with a smaller e) vis-a-vis the private 
input. Formally: 

Proposition 15 (Post-processing) If a function \Pi : X — )■ y is e-differentially private, then any 
composition function $2 ^1 : % — > Z is e' ' -differentially private for some e' < e. 

Before we derive a basic bound on the mutual information leaked across a differentially private 
channel, we need to state one important property of mutual information that we use repeatedly in 
our proofs. The Data-Processing inequality (see Cover and Thomas (2006) for details) states that 
mutual information decreases upon further processing. Formally we have: 

Proposition 16 (Data-Processing Inequality) For random variables X, Y, Z forming a Markov 
chain X — >• Y —> Z, we have that: 

T(X\Z) <1(Y;Z) 

Finally we give a proof for Lemma 4. Similar results have been presented by McGregor et al. 
(2010); Alvim et al. (2011). 

Lemma (Lemma 4 in the paper) Given a random variable X G X of user's private data, a priva- 
tized output 7 G J obtained by any e-local-DP mechanism $ : X — > y, and any side information 
Z, we have: 

I(X;Y\Z) < eloge. 
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Proof 



i{x-y\z)= Y, p{^y\z)io g 

(x, y )exxy 

= ^2 p( x ^y\ z ) l °s 

(x,y)exxy 

(a) x 

< 2^ p(x,y\Z)log 

(x, y )exxy 

< eloge. 



p(x,y\Z) 
p(x\Z)p(y\Z)_ 

p{y\x,Z) 



T, x >exP( x '\ z )p(y\ x '> z ) 



1 



Ex'e* p{x'\Z){p{y\x' ', Z)/p(y\x, Z)) 
1 



z2x>exP( x '\ z ) e ~ 



Where inequality (a) is a direct application of the definition of differential privacy (Equation 1), and 
in particular, the fact that it holds for any side information. ■ 



A.2. Sample-Complexity Lower Bounds for Private Learning 

The main tool we use for deriving lower bounds is Fano's inequality (Lemma 3); in this section 
we state and derive some stronger forms of the same. The item clustering problem fits in a more 
general framework of model selection from finite hypothesis classes, with local-DP constraints: we 
consider a hypothesis class H,\H\ = M, indexed by [M]. Given a hypothesis Z, samples X^ 7 are 
drawn in an i.i.d. manner according to some distribution Pu{ z ) (in our case, u G [U] corresponds 
to a user, and X u the ratings drawn according to the statistical model in Section 2.1. Py.(Z) thus 
includes both the sampling of items by a user, as well as the ratings given for the sampled items). 
Let X j 7 be a privatized version of this data, where for each u G [U] , the output X u is e-differentially 
private with respect to the data X u (by local-DP). Note here that X u and X u need not belong to the 
same space (for example, in the case of the Multi-MaxSense algorithm, X u is a subset of items and 
their ratings, while X u is the collection of privatized responses to the multiple MaxSense queries). 
Note also that the probability transition kernel P% can be known to the algorithm (although the 
exact model Z is unknown). Finally the learning algorithm infers the underlying model from the 
privatized samples. We can represent this as the Markov chain: 

Sampling v Privatization Model Selection 9 

Zj t ft y y y Zj 

In the paper, we considered an algorithm successful only if Z = Z, i.e., the model is identified 
perfectly. A natural relaxation of this is in terms of a distortion metric, as follows: given a distance 
function d : Z x Z — ^ 1Z + , we say the learner is successful if, for a given d > 0, we have: 

d(Z,Z) < d. 

For any h € we define the set B d {h) = {ti G H\d(h,ti) < d}. Further, we define = 
max-heH \Bd(h)\ to be the largest size of such a set. Finally, given a distribution for Z, we define 
the average error probability P e for a learning algorithm for the hypothesis class % as: 

P e = F \d(Z, Z)>d . 
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Then we have the following bound on P e : 

Lemma 17 (Generalized Fano's Inequality) Given a hypothesis Z drawn uniformly from Ti, for 
any learning algorithm, the average error probability satisfies: 

P .>i- , < z ^?) + 1 



log M - log M d 
Proof First, we define an error indicator E as: 



E = 



1 : d(Z, Z)>d 
: otherwise 



and hence P e = P[E = 1]. Define H(x) = —x log(x) — (1 — x) log(l — x). Now we have: 

I(Z;X.V)>I(Z;Z) 

(By the Data Processing Inequality) 
= H(Z) - H(Z\Z) 

> log M - H(Z\Z, E) - H(E\Z) 

(Since Z is uniform over H = [M], and via basic information inequalities) 

> logM - P e H(Z\Z,E = 1) - (1 - P e )H(Z\Z,E = 0) - 1 
(Since H(P e ) > H(E\Z) and H(P e ) < 1) 

> (l-P e )(lo g M-H(Z\Z,E = 0))-l 
(Since H (Z\ Z, E = 1) < log M) 

> (l-P e )(logM-logM d )-l 

(Since H(Z\Z, E = 0) < Iog|B d (Z)| < logM d ) 



Rearranging, we have: 



J(Z;X^) + 1 

logM-logM^ v ' 



We now have two immediate corollaries of this lemma. First, we consider the non-adaptive 
learning case, i.e., where the data of each user X u is obtained in an i.i.d manner. Then we have: 

Corollary 18 Given a hypothesis Z drawn uniformly from H,for any non-adaptive learning algo- 
rithm, the number of users satisfies: 

( UI(Z;X u ) + l \ 
\\ogM-logM d J • 
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Before moving on, we note that these results do not imply that we are assuming a prior on the 
hypothesis class for our algorithms; rather, the lower bound can be viewed as a probabilistic argu- 
ment that shows that below a certain sample complexity, any learner fails to learn a large fraction 
of models. In this light, a stronger restatement of the above result is: if U = o f 108 "^ -1 ^? d \ 

then P e — > and hence any learning algorithm fails to learn the underlying hypothesis even upto a 
distortion of dfor almost all models in the hypothesis class. 

Next, using Lemma 4, we get a bound on the sample complexity of learning under local-DP. 

Corollary 19 Given a hypothesis Z drawn uniformly from %, for any learning algorithm on U 
privatized samples, each obtained via e-local-DP, the average error probability satisfies: 



Pe > 1 



If Ue + l 



ln2 \logM - log M a 



Returning to our problem of learning item clusters, we note that M = in that case. Further, 
by choosing d as the edit distance (Hamming distance) between two clusterings of items (i.e., for 
two clusterings CV and C' N , d(Cjv, C' N ) is the the number of items that are mapped to different 
clusters in the two clusterings), we get that: 



[Binomial (N, 1/K) > N — d] 



i=0 

K N 



K\ 

K N (-NK(l-±-±f 



-in exp v .> j 

Now, using the above results, we can derive a more general version of Theorem 5. 

Theorem 20 Suppose the underlying clustering Cjy(-) : [M] — > [K] is drawn uniformly at random 
from {0, 1}^. Further, for a given tolerance d > and error threshold p max , we define a learning 
algorithm to be unreliable for the hypothesis class T~L if: 



max J 

he[M] 



d(Z, Z)> d > p max . 

Then any learning algorithm that obeys e-local-DP is unreliable if the number of queries U satisfies: 

U < (1 -p ma 



V 3e I 

Appendix B. The Pairwise Preference Algorithm 

In this appendix, we define and analyze the Pairwise Preference Algorithm for identifying the item 
classes in the information-rich regime. The algorithm is formally specified in Algorithm 1 . 

Theorem 21 (Theorem 8 in the paper) The Pairwise-F 'reference algorithm is e-differentially pri- 
vate. Further, in the information- rich regime, suppose we have the following non-degeneracy condi- 
tions on the eigenvalues and eigenvectors of A: 
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• The L largest magnitude eigenvalues of A have distinct absolute values. 

• The corresponding eigenvectors yi,y%, ■ ■ ■ ,Vl (normalized under the a-norm, which is de- 
fined as ||y||2 = J2k a kv\) satisfy: 

tk^M , 1 < k < I < L 

where t k = (yi(k), . . . ,y L {k)). 

Then there exists c > such that the item clustering is successful with high probability if the number 
of users satisfies: 

U > c(NlogN) . 

Proof As mentioned before, privacy for the algorithm is guaranteed by the use of e-DP bit release 
(Proposition 2), and the composition property of DP (Proposition 14). 

Algorithm 1 The Pairwise-Preference Algorithm 

Setting: N items [N], U users [U]. Each user has a set of w ratings (W u , R u ), W u G (^) , R u € 
{0, 1} W . Each item i is associated with a cluster Cn(i) from a set of L clusters, {1,2,..., L} 
Output: The cluster labels of each item, i.e.,{CN{i)}ie[N] 
Stage 1 (User sketch generation): 

• For each user u G [U], the algorithm picks a pair of items P u = {i u ,ju}'- 

- At random if w = Q(N) 

- If W u is known, it picks a random set of two rated items. 

• User u generates a private sketch S% given by: 



1 : R u {i u ) = Ru{ju) = 1 
: otherwise 



Where R u i = 1 if i e W u and item i is rated positively, and otherwise. 

Stage 2 (User sketch privatization): Each user u G [U] releases a privatized sketch S u from 5° 
using the e-DP bit release mechanism (Proposition 2). 
Stage 3 (Spectral Clustering): 

• Generate a pairwise-preference matrix A, where: 

Aij = S u 
ueU\P u ={i,j} 

• Extract the top L normalised eigenvectors x±,X2, ■ ■ ■ ,xl corresponding to the L largest mag- 
nitude eigenvalues of matrix A. Embed each row (node) into L-dimensional Euclidean space 
by assigning as coordinates the corresponding entries of the L eigenvectors. 

• Perform k-means clustering in the profile space to get the item clusters 
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We will prove the sample complexity bound for the case where w = £l(N), as the other case 
(where the rated items are not private) follows similarly. From the definition of the e-DP bit release 
mechanism, we have that: 

1 + (e« - l)P[5g = 1] 

r[bu ~ 1J " F+l ' 

and thus for any pair of items {i, j}, defining bij = J2k=i a k{bkibkj + (1 — ^fei)(l — Hj)) (i- e -> the 
probability that a random user has identical preference for items i and j) and = 1 — b^, we have: 

F\S -IP -ft 7-11- 1 ( 1 I (^"^ ^-^ j ^ * b Ji 
[ ^ u ~ ' u ~ il,Jil ~ N(N-l) {e* + l + W + V iV(iV - 1) *V ~ N(N - 1) ' 

and similarly: 

P[5 U = 0,P M = {i,j}] = iy(Ar _ 1) + (f + t) N(N-l) ibij ~ 1} ) " N(N-iy 

where, under the assumptions that w = fi(iV) and e = 0(1), we have that 6^ , fey are both 0(1). 
Now, since Aij = l{s u =i,p u ={ij}}, we have that: 

Aij ~ Binomial [U, 



n(n - 1; 



And setting U = cN log N, we have that: 

F[A i:j > 0] = 1 - ( 1 



y 



iV(iV - 1) 



N(N-l) \N 4 
-c'b l ° gN 1 Q ^ l0giV ) 2 

Thus we can interpret A as representing the edges of a random graph over the item set, with an 
edge between an item in class i and another in class j if > 0; the probability of such an edge is 

can now use clustering results from Tomozei and Massoulie (2011) to complete 



N 

the proof. ■ 

We note that the above analysis does not give us exact scaling behavior with respect to e; this would 
require more detailed analysis. However, the MaxSense algorithm analyzed in Appendix D al- 
lows us to determine the trade-off between privacy and performance more accurately. Furthermore, 
the MaxSense algorithm, under stronger separation assumptions than the Pairwise-Preference algo- 
rithm, achieves the same scaling with respect to N in the information-rich regime and also optimal 
scaling for many other regimes for w. 
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Appendix C. Lower bounds for the Information-scarce Setting 

In this appendix, we prove the results stated in Section 4. Recall that we consider a scenario where 
there is a single class of users, and each item is ranked either or 1 deterministically by each user. 
Cjv(-) : [N] — > {0, 1} is the underlying clustering function. We assume that the user-data for user 
u is given by X u = (I u , Z u ), where I u is a size w subset of [N] representing items rated by user u, 
and Z u are the ratings for the corresponding items; in this case, Z u = {Z(i)} ie j u . We also denote 
the privatized sketch from user u as S u G S. Here the space S from which sketches are drawn is 
assumed to be an arbitrary finite or countably infinite space. The sketch is assumed to obey e-DP. 
Finally, we assume that Z is chosen uniformly over {0, 1}^, and the set of items I u rated by user u 
is also assumed to be chosen uniformly at random from amongst all size-w subsets of [N] . 

We first develop two general lemmas in Section 4.1 which we use in our proof, but which can 
potentially be used for other similar situations. Then, in Section 4.2, we use these to derive tighter 
bounds on the scaling required for accurate cluster learning. 

C.l. Local Differential Privacy and Mutual Information 

We first establish two lemmas that we need in order to obtain the lower bound for learning in the 
information-scarce regime. The first lemma is a simple consequence of differential privacy and 
establishes a relation between the distribution of a random variable with and without conditioning 
on a differentially private sketch: 

Lemma 22 Given a discrete random variable A G A and some e-dijferentially private 'sketch ' 
variable S G S generated from A, there exists a function A : A x S —> [e~ e , e £ ] such that for any 
a £ A and s G 5: 

P(A = a\S = s)= F(A = a)X(a, s) 

Proof 

F(A = a)P(5 = s\A = a) 



\A = a\S = s) 



Z a , eA nA = a>)F(S = s\A = a> 
(From B ayes' Theorem) 

= F(A = a) F ( A = 



= F(A = a) ( P ( A = 

Ka'eA 
= F{A = a)X{a,s) 

Further, from the definition of e-differential privacy, we have that: 

- FjS = s\A = a') 
- ~ ' --s\A = a) 




and hence we have X(a, s) G [e e , e e ], V a G A, s G S. ■ 

Recall that we define (^) to be the collection of all size-u; subsets of [N] = {1,2,..., N}, and 
V = (^ ] ) x {0, 1} W to be the set from which user information (i.e., (I, Z)) is drawn (and define 
D = \V\ = (^)2 W ). Finally Kx[-] indicates that the expectation is over the random variable X. 
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We now establish the following general lemma, which we will use to bound the mutual infor- 
mation between the model and the sketch. 

Lemma 23 Assume that under probability distribution P, the set I of items whose type is available 
to a given user is independent of the type vector Z. Denote p s (i, z) := P((I, Zj) = (i,z)\S = s). 
Let also for subsets j C [N] denote Pj(z) := f(Zj = z). Then the following holds: 



T{Z\ S = s) < ^2^2p s {i, z)p s (i', z' 

i,z i',z' 

Proof From the definition of mutual information, we have: 



_ Pi Ui/ (zUz') _ 
Z=Z ' Pi(z)Pi'(z') 



(6) 



Z(Z; S) = ^ P[(Z, 5) = (z, s)} log - 



[Z = z]P[5 = s 

= E s [l{Z;S = s)], 

where we use the notation: 

m S = s) := £ P[Z = z|S = s] log ( P[Z p yl 5 = ' : 

Now note that 

F[Z = z\S = s]= ^ m = z,(I 1 ,Z 1 ) = (i 1 ,z 1 )\S = s] 

= ]T P[Z = z\i 1 ,z 1 ]P[(I 1 ,Z 1 ) = (i 1 ,z 1 )\s] 
Z = z] 

Pi{Zi) 



^ . . P Z = z 



Combining the equations, we get 

Using Jensen's inequality, the R.H.S. is upped bounded by the corresponding expression where 
averaging over z conditionally on Z^ = z\ is taken inside the logarithm, yielding 



I(Z;S = s) < ^p s (ii,zi)log ^1 Z = Z1 — - ^2p s (i2, 



Z2) 



Ph(zi) ^ ' Pi 2 {z2) 

E/- m ( c \n Phui 2 (zi U z 2 ) 
P.(*l, *) E Ps^ Z2)1 Z ^ Z2 

n,zi \«2,22 

The result now follows by using the inequality log(x) < x — 1. ■ 

Note that in the above lemma we do not make any assumption regarding: i) the distribution of Z, ii) 
the distribution of the user-data (I, Z). Now assuming that Z is uniformly distributed on {0, 1}^, 
we get the following corollary, which was stated before as Lemma 7: 
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Corollary 24 Given the Markov Chain Z — > (I, Z) — > S, where Z is drawn uniformly from 
{0, 1}^, let (Ii, Zi), (7 2 , Z 2 ) £V be two pairs of 'user-data' sets which are drawn i.i.d accord- 
ing to the conditional distribution of (I, Z) given S = s. Then, the mutual information X(Z; S) 
satisfies: 

X(Z;5) <Ec 



^(h,Zi)\SJL(I 2 ,Z 2 )\S 



Unless we specifically mention otherwise, when we refer to Lemma 23, we will mean this corollary. 
C.2. Learning with 1-bit sketches 

We next provide a proof of Theorem 8 from the paper. We restate the theorem for convenience. 

Theorem (Theorem 8 in the paper) Suppose w = 1, with (I,Z) drawn i.i.d uniformly over 
[N] x {0, 1}. Then for any 1-bit sketch derived from (I, Z), it holds that: 



X(Z,S) = o 



and consequently, there exists a constant c > such that any cluster learning algorithm using 
queries with 1-bit responses is unreliable if the number of users satisfies: 

U < cN 2 , 

Proof In order to use Lemma 23, we first note that X(Z, S) is a convex function of P[S = s\ Z = z\ 
for fixed P[Z = z] (Theorem 2.7.4, Cover and Thomas (2006)). Writing P[S = s\Z = z] as 
£V , p[5 = S \(I, Z) = (i, z)]P[(I, Z) = (i, z)\Z = z], we observe that the extremal points of the 
kernel P[S = s|Z = z] correspond to P[S = s\(i, z)\ G {0, 1}, where the mutual information is 
maximized. This implies that the class of deterministic queries with 1-bit response that maximizes 
mutual information has the following structure: given user-data (I U ,Z U ), user it's response S u 6 
{0, 1} is of the form S u = 1a(I u , Z u ), where A C {(i, z)\i £ [N],z G {0, 1}}. In other words, 
the algorithm provides user u with an arbitrary set A of (items,ratings), and the user identifies if 
(I u , Z u ) is contained in A. 

Defining pf z = P[(I, Z) = (i, z)\S = s], for a query response S = 1a{Iu, Z u ), we have the 
following: 

i P[[(I,Z) = (i,z)]P[S = l\(i,z)} 



z u , z/j) m,z) = (j,z' j )ns=i\(j,z' j )} 

1a(i,z) 



Ef=i{iA(i,o)+i A (i,i)} 

1 A {i,z) 



and similarly p\ z = where A is the complement of set A. From Lemma 7, for (I\,Zi)\S _L 

_L (/2, Z 2 )\ S we have: 

Z(Z,5) < E s [e [2l /in/2 ll {Zl ^ 2} - l] 

= ]T P[5 = -]E [l {Jl=/2} (21 {ZlsZ2} - 1) |5 = s] . 
se{o,i} 
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Introducing the notation P(I = £,Z(£) = a\S = s) = 7r| CT , the following identity is easily 
established: 



N N 



J> [t {h=h = e} {21 {Zl(i)=Z2m - 1) \S = s] = Y, Ko " <i) • (7) 

l=l i=i 

The left-hand side of (7) is thus a non-negative definite quadratic form of the variables pf (since 

Tf,* = Ei,a\mxt)=* p V- Thus we have: 

iV 

x(z,s)< ^ = s iEKo-<i) 2 

se{o,i} <=i 

1 W 

= p [ 5 = fl ]n-j2 5^ 1 {l^n{(i,o),(i,i)}|=i}> 

se{o,i} 1 s| i=i 

where A s = A if s = 1 and ^4 if s = 0. Now for a given ^4, consider the partitioning of the set [N] 
into C U Ci U C 2 , where for A; = 1,2, 3, Mi G C fc , |A n {(i, 0), (i, 1)}| = k. We then have the 
following: 

1(Z,S) <F[S= 1)^1 + F[S = 0]] Cl 



|^|2 J |^|2 

|Cil Z' i i 



2iV \J>1| 2N -\A\ 
1 

< — . 

~ N 

Now using Fano's inequality (Lemma 17), we get the theorem. 



C.3. Lower Bound on Scaling for Clustering with Local-DP 

Recall that we defined V = x {0, 1} W to be the set from which user information (i.e., (I, Z)) 
is drawn. We write P° for the base probability distribution on (Ii, Z\) and (I2, Z2), i.e., the two are 
independent and uniformly distributed over V, and denote by E° mathematical expectation under 
P°. For completeness, we state and prove the following basic asymptotic estimate which we use 
several times in the subsequent proofs. 



Lemma 25 If w = o(N), then: 



(N-w\ 



Proof 



o 

/N-w\ 
\ w I 



W 



w—1 



n > 



k=0 



e 



w 



w 

AT2 



N-k 
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\ Aw) - (- w 



1 < v w ' < 1 - — 

N-w + l) ~ (J) -V AT 
Now for the upper bound, using the binomial expansion, we have: 

W\ w W 2 tt> 4 



1-^ =1-^ + 



NJ N 2N 2 

w 2 _ / w 4 



Similarly for the lower bound, we have: 



, 2 4 

W \ W W 



7 + 



iV-w + iy iV-w + l 2(iV-u; + l) 2 



2 S 4 
> 1- TTT - TT7TT 7T + 



iV iV(iV-u; + l) 2(iY-w + l) 2 



TV V N 2 



Now we can prove Theorems 9 and 10: 

Theorem (Theorem 9 in the paper) In the information-scarce regime, i.e., when w = o(N), we 
have that: 

'w 2 \ 



X(Z,S) = 0^-). 

and consequently, there exists a constant c > such that any cluster learning algorithm with e- 
local-DP is unreliable if the number of users satisfies: 



U <c 



Proof To bound the mutual information between the underlying model and each private sketch, we 
use Lemma 7. In particular, we show that the mutual information is bounded by \ f° r anv given 
value s of the private sketch. Below we denote by E s [•] expectations conditionally on S = s. 
Consider any sketch realization S = s. Now, we have: 



E, 



2 |/in/2| l { z 1 ^ 2} 



1 



< E. 



l{Zi=Z 2 } 



2^in/2| 



1 
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The RHS of the above equation is a non-negative quadratic function of the variables {pi, z }(i 
where p i z = P[(J,Z) = (i,z)\S = s]}. Now, using Lemma 22, we get: 



2 |/in/2| l { z 1 .z 2} - l] < e 2e E° [l {ZlsZa} - 1 



[t { \ Iin i 2 \= k} i {ZlsZ2} (21^1-1 

fc=0 
to 

fc=0 

e 2e (A! + A 2 ) 



2 -k 2 k -l 



where 



A x = -E° [l min/2 | =1}J ,zA 2 



A 2 =E U 

We bound each of these terms separately. For Ai, we have: 

A: = ±E° [t {lhnh \=i } ) 
1 N 



%n/ 2 |>i (1 - 2-1^1 



2© 



to 



2(JV - 2w + 1) 
O 



2 / 4 

tO / tO 



to' 

Iv 



Similarly for A 2 , we have: 

A 2 < E° [l { |/ in / 2 |>i } ] 

= i-P°[|/in/ 2 | < 2] 



1 



\wJ 



W 



N-2w + lJ ( N ) 

,2 



(N-w\ 



ur 



N - 2w + 1 



i- — -o(—\ 

N \N 2 J 



1 \ 



Combining equations (8) and (9), we get the result. 
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Note that the dominant term in the above proof is the bound on Ai, which is closely connected 
to the case considered in Theorem 8, where we assumed w = 1. Now we have: 

Theorem (Theorem 10 in the paper) Under the scaling assumption w = o(N 1 / 3 ), and for 
e < ln(2), it holds that 

J(Z,5) = /,r 



.JV, 

and thus there exists a constant c > such that any cluster learning algorithm with local-DP is 
unreliable if the number of users satisfies: 

'N 2 

U<c[ — 

w 

Proof In the proof of Theorem 9, the two steps which are weak are the conversion to the base 
measure P°[] using Lemma 22, and the evaluation of the bound for A^ We start off by performing 
a similar decomposition of the bound, but without first converting to the base measure. For any 
S = s, we have: 



E 



TV 

2 |/in/2| l { ^z 2 " l] = E E t 1 {/ 1 n/ 2 =W}( 2 * M^z 2} ~ 1)] 

+ E[l { | JinJa | >1} (2^ n/ »ll {ZlsZa} -i; 
= A; + A'( + A' 2 



where 



N 

A i = E E [ 1 {te/inJ 2 }( 2 * %M=z 2W - 1)] 
e=i 

N 

A i = ~ E E [%e/in/ 2 ;|/ini2|>i}( 2 * 1 z 1 (e)=z 2 (e) ~ 1)] 
l=i 

A' 2 = E [l|7 1 n/ 2 |>i(2 |/in/2| l { z 1 ^ 2} - 1) 

Note that A' x + A" are similar to Ai and A' 2 similar to A 2 in Theorem 9 (albeit without first 
converting to the base measure). Unlike before, however, we first bound A" + A' 2 , establishing that 
A'/ + A' 2 = 0(w 4 /N 2 ) = o{w/N) whenever w = o{N 1 ' z ). For A' : , we need to employ a more 
sophisticated technique for bounding. As before, we write P° for the base probability distribution 
under which (Ji, Z%) and (I2, Z2) are independent and uniformly distributed over V, and denote by 
E° mathematical expectation under P . For A", we have: 

N 

A" < E E [ 1 {^/in/ 2 ;|/in/ 2 |>i}] 
1=1 

= E[|/ 1 n/ 2 |i { | /in/2 | >1} ] 
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Since the RHS is non-negative, we use Lemma 22 to convert the expectation to the base measure. 
Thus, we get: 

A'/ < e 2e [E° n h\] - P° n h\ = l]] 



2c 



(/.' 



-,2c 



(V) 

iv v^-^ + iy (^) 



(10) 



Similarly for A' 2 , we have: 



A^<E[l { | /in/2 | >1} 2l^ n ^ll {Zl ^ 2} 



<e 2e E° [l { |/ in / 2 |>i}2 |/in/2| l { z 1 ^ 2} 
<e 2e P°[|/in/ 2 | > 1], 

as P° \Z\ = Z2] = 2~l /in/2 L Now since I\ and I2 are picked independently and uniformly over all 
size w subsets of [N] (under P°), we have: 

/ /N-w\ , /N-w\ \ 

A' 2 <e- (l- ( - ) t, L - l) ) 



e 2e 1-11 + 



if" 



{N-w\ 



n - 2 W + 1 y (^) y 

Finally combining equations (10) and (11), we get: 

2 ' ^ \ 



(ii) 



A' 1 ' + A' 2 <e 2 Ml + :--(l + 



w 



N 



N - 2w + 1 



(N\ 



and using Lemma 25, we get: 



A'{ + A' 2 < e 2e ( 1 + 



~N 



2w 



2w 2 (2w - 1) 



+ ~W + iV(iV - 2w + 1) ) V ~lV _ 



/ „,,3 



N2 



Thus, we now have: 



E 



2 \inJ\ t 



{z(e)=z'(e)veeinj} 



N { 4 \ 

l] <^E[l {<£fnJ} (2*l z(H)(r l)] + (^2) 
£=1 ^ ' 



Under the scaling assumption w = o(A f1 / 3 ), the second term in the right-hand side of the 
above equation is o(w/N), and we only need to establish that the first term in the right-hand side is 

0(w/N). 
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As in Theorem 8, we introduce the notation F(£ 6 /, Z(£) = a\S = s) = TTe,a (here we can 
omit indexing with respect to s for notational convenience). The following identity is then easily 
established: 

N N 

^2 K1 {eehni 2 } { 21 z 1 (e)=z 2 (e)} ~ l) = 0^,0 - ^e,i) 2 ■ (12) 
The left-hand side of (12) is thus a non-negative definite quadratic form of the variables 

PiyZ :=F(I = i,Z = z\S = s), 

where we have that 7r^ CT = J2i a \£ei z{i)=aPhz * n (12). We know however by Lemma 22 that these 
variables are constrained to lie in the convex set defined by the following inequalities: 

(i,z)ev 

Defining e' := e e — 1 = max(e e — 1, 1 — e~ e ), we can relax the last constraint to 

1 - e' < Pi, z D < 1 + e'. 

Provided e is small enough (precisely, provided e < ln(2), which we have assumed), it holds that 
e' < 1. 

Given this setup, we can now formulate the problem of upper bounding A[ as the following 
optimization problem: 

N 

maximize ^ (vr^ - vro) 2 

subject to ^2 Pi,z = 1) ^ 

p i>z D e [1 - e', 1 + e'] . 

In order to evaluate this bound, we need to first characterize the extremal points of the above convex 
set. We do this in the following lemma. 

Lemma 26 The extremal points of the convex set of distributions {pi lZ } defined by (13) consists 
precisely of the distributions pf z indexed by the sets A C V of cardinality 

N\ .„ , D 



A= 2 



yw — 1 

iv 2 
defined by 

P »-\ if(i,z)tA. (14) 
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Proof Let {pi jZ } be a probability distribution satisfying constraints (13). The aim is to establish the 
existence of non-negative weights 75 for each subset S C V of size D/2, summing to 1, and such 
that for all (i, z) G V, one has 

Pi,z= Yl ls(l + e%, z)€S -e'l { i, z)t s)/D. (15) 

SdV,\S\=D/2 

Let us now express the existence of such weights 75 as a property of a network flow problem. For 
each n G [D] , define 

/ 1 — e'\ D 
a n ■= [Pn- 



v D J 2e' 

The constraint p n G [(1 — t')/D, (1 + e')/D] entails that a n G [0, 1]. Construct now a network 
with for each n G [D] two links, labelled (n g) and (n <^), and with respective capacities a n and 
l — a n . In addition, for each set S C [D], \ S\ = D/2, create a route r$ through this network, which 
for each n G D crosses link (n g) if n G 5, and crosses link (n ^) if n G" 5. All such routes are 
connected to a source and a sink node. 

We now claim that the existence of probability weights 75 satisfying (15) is equivalent to the 
fact that the maximum flow through this network is equal to 1 . Indeed, the existence of a flow of 
total weight 1 is equivalent to the existence of a probability distribution 75 on the routes r$ through 
this network which match the link capacity constraints, that is to say such that for all n G [D] , one 
has 

Es-.n^S^S = l-Ot n . 

It is readily seen that this condition implies (15). Conversely, if the probability weights 75 satisfy 
(15), using the definition of a n , it is easily seen that the two previous equations hold. 

Let us now establish the existence of such a flow. To this end, we use the max flow-min cut 
theorem. Any set of links that contains, for some n G [D], both links (n G) and (n ^), is a cut, and 
its capacity is at least a n + 1 — a n , hence larger than 1. Any cut C which for each n either does not 
contain (n G) or does not contain (n must be such that either 

|Cn{U ne[jD] (nG)}| >D/2 (16) 

or 

\Cn{U ne[D] (n(jt)}\> D/2, (17) 

for otherwise we can identify S C [D], \S\ = D/2 which crosses this cut C. Assume thus that (16) 
holds. Assume without loss of generality that C contains the links (n G) for all n = 1, . . . , D/2+1. 
The weight of this cut is thus at least a n- We now argue that this must be at least 1. Indeed, 

it holds that 

D 

Y J <*n = D/2. 

n=l 

However, if Xm=i +1 an < ^' usrn § th e fact that each a n is at most 1, it follows that Yln=l a n ^ s 
strictly less than I + D/2 — 1 = D/2, a contradiction. The case when cut C verifies Equation (17) 
is similar. ■ 
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We can now complete the proof of Theorem 10. Since as argued the second term in the right- 
hand side of (12) is a non-negative definite quadratic form of the pi >z , it is in particular a convex 
function of the p^ z , and as such is maximized over the convex set described by (13) at one of its 
extremal points, which are precisely identified by Lemma 26. It will thus suffice to establish the 
following inequality for all A C V of size half the cardinality of the full set: 

N 

EMo-^i) 2 <owao, (18) 

t=i 

where we introduced the notation for all £ E [N] and a G {0, 1}: 

= £ E pm> 

i:£& z:z(l)=a 

and pf z is as defined in (14). Introducing also the sets 

Ae,o- = {{h z ) £ £ i and z(£) = a}, 

we have 

= jM^ [\A l0 n A\ - \A^ n A\\ 



2e' 



(19) 



where in the last display we used the following notations. (-, •) stands for the scalar product in 
1a is the characteristic vector of the set A, and ve is defined as 

v e (i,z) = l {iei} (l-2z(£)). 

Equation (19) entails that the left-hand side of Equation (18) also equals 



E(tO 0^> 2 . (20) 



The scalar product (vg, vy) reads, for I ^ £': 



(vt,ve) = E l:Vei E,(l-2^W)(l-2z(f)) 

= E Wi 2^ 2 2[(l)*(l) + (l)*(-l)] 
= 0. 



Note further that for all I G [N], one has 



N-l\ wD 



\w\\ 2 = \" _ |2' 

11 m \w-l) N 

Orthogonality and equality of norms among the vg readily implies that the expression in (20) is 
upper-bounded by 

-V— IM 2 - 

D I N 11 " 
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Recalling that the vector 1^ has 5 entries equal to 1, and all other entries equal to zero, the square 
of its Euclidean norm ||1a|| 2 equals precisely y. Plugging this value in the last display, after 
cancellation, one obtains that the expression in (20) is bounded by 

N 

This completes the proof. ■ 



Appendix D. The MaxSense algorithm 

In this appendix, we state and prove the privacy and performance guarantees of the MaxSense 
algorithm, and its variants. The algorithm is formally specified in Algorithm 2. 

Algorithm 2 The MaxSense Algorithm 

Setting: N items [N], U users \U\. Each user has a set of w ratings (W u , R u ), W u G (^) , R u G 
{0, 1} W . Each item i is associated with a cluster CV(i) from a set of L clusters, {1,2,..., L} 
Output: The cluster labels of each item, i.e.,{Cjv(i)}j g [7v] 
Stage 1 (User sketch generation): 

• For each user u G [U], generate sensing vector H u G {0, 1}^, where H ui is a 'probe' for 
item i given by: 

H ui ~ Bernoulli (p), i.i.d, 
with p = — , where 9 is a chosen constant. 

• User u generates a private sketch 5° given by: 

Su(W u , R u , H u ) = max H u iR u i 

Where R u i = R u i if i G W u , and otherwise. 

Stage 2 (User sketch privatization): Each user u G [U] releases a privatized sketch S u from S® 
using the e-DP bit release mechanism (Proposition 2). 
Stage 3 (Item Clustering): 

• For each item i G [N], compute a count Bi as: 

B% = ^ H u iS u 

u&A 

• Perform k-means clustering using the counts {Bi} i€ ^ with k = L. 
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Theorem (Theorem 11 in the paper) The MaxSense algorithm (Algorithm 2) is e-differentially 
private. Further, suppose we define: 



2(e £ - 1) 
(e £ + 1) 



mm 

1<1<1'<L 



E 

k=l 



a k e 



(b 



kl 1 



then for any d > 0, there exists a constant C > such that if the number of users satisfies: 

' N 2 log AT 



U > C 



e 2 ^ 2 ■ w 

mm 



N~ 



(21) 



then the clustering is successful with probability 1 
Proof 

Privacy: For each user u, observe that H u is independent of the data (W u , R u ), and hence preserves 
privacy. Next, given H u , we have that (W u , R u ) —> S® — > S u form a Markov chain, and hence it 
is sufficient via the post-processing property to prove that S® — > S u satisfy e-differential privacy 
This is a direct consequence of using the e-DP bit release mechanism (Proposition 2). Now, using 
the post-processing property of differential privacy (Proposition 15), we get our result. 

Performance: The intuition behind the correctness of the clustering of items is as follows: 
First, we show that for any item j, its count Bj will concentrate around -Bjfj), the expected count 
for its corresponding cluster. Next, we calculate the minimum separation between the expected 
counts for any two item-clusters (denoted as A m ; n ). Finally, we show that for the given scaling of 
users, with high probability we have that each item count Bj is within a distance of A m ; n /5 from 
its corresponding Bi(j). This will then ensure that any two items belonging to the same cluster 
are within a distance of 2A m i n /5, while two items of different clusters have a separation of at least 
3A m j n /5, thereby ensuring successful clustering. 

First, for any item i G [N], consider the random variable Bi = Ylueu H u iSi- We have that: 

E[Bi] = ^ E[H ui S u ] 
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using the i.i.d sensing property and the definition of the privacy mechanism. Now, we substitute 
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(Note: for small e, we have ??» e) to get: 
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We use the shorthand notation k(u) to denote the user-cluster of user u and similarly to denote 
the item-cluster of item j to get: 



u 



E[B i ] = J2p 
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1 e e 

2 + 4 _ 2 



1 - 
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(Using the i.i.d sensing properties of H u i) 
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(Grouping terms by user and item classes.) 



Note that we have dropped the explicit dependence on the user index and retained only the user- 
cluster label. Similarly, we henceforth write k and / for k(i),l(j) respectively, whenever it does 
not cause confusion in the notation. Also from the algorithm specification, we have pw = 9. 
Furthermore, we define: 

g gAp [s o = o| J b( U ) =fc]= n (i-^p), 

je[N] v 7 

i.e., q® is the probability that a user u of cluster k will have a (private) sketch S® equal to 0. Thus 
we have: 
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Before continuing, we need to analyze the term q®. We have: 
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From this we can see that: 



1 



Thus we see that for any user-cluster, the probability of the MaxSense sketch being is 6(1). 
Intuitively, this means that each sketch has close to 1 bit of information. We define q° = J2k=i Qk 
(which is the probability that a random user's sketch is 0). Now, noting that the expectation of B, L 
only depends on the class of item i, we define B\ = E[5j|Z(«) = /]. Then we have: 
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Suppose w = o(N). Since e < 1, then for sufficiently large N, we have that for all item classes 
l€{l,2,...,L}: 

Bi < Up 

Next, given any two distinct item classes I, m, we define A im 4 E[|B, - B m |]. Then we have: 
A; m > |E[5j — i? m ]| (By Jensen's Inequality) 
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where we define: 
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Let A m ; n = min; mg [ L ]2 ^ m Ai m . Now, for a given item j, a standard Chernoff bound (applicable 
since the sketches are independent and bounded) gives us that for any a > 0: 



\Bj - B t (j)\ > aBM)] < 2exp -— 



Following the above discussion, we choose a = -=su=. Then we have: 

6 5B t (j) 



A, 



< 2 exp 

< 2 exp 



A 2 . \ 

mm 

/ A 2 ■ \ 

I mm \ 

\75Up) 



and by taking union bound over all items, we have: 

A, 



sup \Bj-Bi(J)\ > 



< exp I log 2N 

< exp ( log 2N 



where we have used p = f- Now if we choose U as: 
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then we have: 



sup \Bj-Bi{j)\ > 
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Thus, if the number of users scale according to (21), then the clustering is successful with probabil- 
ity 1 - N- d . U 
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Finally, we define Multi-MaxSense, a generalization of Algorithm 2, wherein we ask multiple 
MaxSense questions to each user. Each question now has a privacy parameter of -q, where Q is the 
number of questions asked to a user-thus we obtain e-DP via the composition property (Proposition 
14). Independence between the answers is ensured as follows: first, for each user, we choose a 
random partition of [N] into ~ sets, each of size Np; we pick Q of these and present them to the 
user. Next, each user calculates Q sketches using these Q sensing vectors, and reveals the privatized 
set of sketches (with each sketch revelation obeying ^-differential privacy. Finally, we calculate the 
item counts as before. More formally, the algorithm is given in Algorithm 3 

Algorithm 3 The Multi-MaxSense Algorithm 

Setting: N items [N]. U users [U], each with data (W u , R u ) G ( [ ^ ] ) x {0, 1} W . Parameter Q. 
Output: The cluster labels of each item, {CN{i)}ie[N] 
Stage 1 (User sketch generation): 

• For each user u G [U], generate Q sensing vectors Hi u>q ) G {0, 1}^, where each vector is 
generated by choosing Np items uniformly and without replacement. As before, p = 

• User u generates Q private sketches S® u ^ as in Algorithm 2 

Stage 2 (User sketch privatization): Each user u G [U] releases Q privatized sketches, where each 
sketch is generated using a ^-private bit release mechanism (Proposition 2). 
Stage 3 (Item Clustering): 

• For each item i G [N], compute a count B< = J2 u eU T, q e[Q] H (u, q )iS( u ,q) 

• Perform k-means clustering using the counts {-E>i}«e[7v] with k = L. 



Now we have the following theorem. 

Theorem 27 The Multi-MaxSense algorithm (Algorithm 3) is e-dijferentially private. Further, sup- 
pose Q = \e\. Then for any d > 0, there exists a constant c such that if the number of users 
satisfies: 

N 2 \ogN\ 

then the clustering is successful with probability 1 — N~ d . 

Proof Privacy: Since each user reveals Q bits, and each bit is privatized using a ^-differential 

private mechanism, therefore for any user it, the user sketch {S Utq }q =1 and user data (W u , R u ) are 
private using the composition property (Proposition 14). The remaining proof for the privacy of the 
learning algorithm is as before, using the post-processing property. 
Performance: To show the improved scaling, we need to observe the following: 

1. Due to the way in which the sensing vectors are chosen, the probability of any probe for a 
item in any sensing vector (i.e., for some u G [U], q G [Q], i G [N]) being set to 1 is 

p, i.i.d. 



U> c 
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2. Further, since the multiple sensing vectors given to a single user do not overlap, therefore the 
sketches {S u>q } u>q are also independent. 

Hence, the analysis in Algorithm 2 can be repeated with U being replaced with QU and e being 
replaced with ^. Choosing Q = \e] implies that we now have: 



~ H exp WtJ- 1 

e = — - — 



> 



exp^) + l 
2(e-l) 



e + 2 

Substituting these in equation 21, we get the condition for correct clustering with high probability 

as: 
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Appendix E. Lower Bound for Adaptive Queries 

The lower bounds of Section 4 applied to non-adaptive learning, where queries to user u are de- 
signed without leveraging answers of users 1, 1. One can extend these bounds to the adap- 
tive setting where query to user u is allowed to depend on the previous queries and answers of users 
1, . . . ,u — 1. Specifically we now assume that questions are asked to users sequentially, and the 
question to which the i-fh user answers can be affected by the previous sketch releases Si,..., St-i 
of the t—X previous users. We shall now prove the following: 

Theorem 28 Assume w = X. If users' answers are e-DP, the number of adaptive queries needed 
to learn unknown content clustering into two types drawn uniformly at random from {0, 1}^ is 
Q(iVlogiV). 

Proof 

In the sequel we assume that T—X sketches have been released, and denote by P T the probability 
distribution conditionally on the previously observed sketch values. We shall develop bounds of the 
form: 

T{Z; Sf) < 5 T 

for a suitable function St- 

These bounds are obtained inductively as follows. First, we expand the mutual information as: 

T 

z{z-s1) = Y J AZ;S t \s{- 1 ). 
t=i 

Now recall from before that we define: 
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i.e., the mutual information between U and V = v conditioned on W = w. Hence, we have 
J2 V ¥(V = v\W = w)l{U; V = v\W = w) = X(U, V\ W = w) and Y, w = w)l{U; V\W = 
w) = T(U, V\W). Further, using this definition, we can bound the mutual information as: 



1(Z; S{) < 1(Z; Sf- 1 ) + sup 1{Z; S T = s\S( 



T-l 



„T-1 



)• 



Now consider any sequence {s^ , s}. Defining P T to be the probability measure conditional 



on 1 = «i \ we use Lemma 23 to bound the term 1(Z; St = s\S( 1 = L ), to get: 



„t-i 



T-l 



X(Z-S T = s\S[ 



T-l 



12,22 



\Z2 rp 



where p T+1 (i,z) = P T+1 [(/,Z) = (i,z)] and pj +1 (z) = ¥ T+l [Z(i) = z] (in other words, all 

rp i rp -i 

quantities are defined w.r.t. the probability measure conditional on S 1 = s 1 and St = s). 

Using Lemma 22, we have p T+1 (i, Zj) = fi(zi) jtpJ {zi) where the likelihood ratio fife) be- 
longs to [1 — e', 1 + e'] where e' = e € — 1. Using this expression, the RHS. of the previous inequality 
can be rewritten as: 



1(Z;S T = s\S( 



T-l 



„T-1n 



' N 

E 

.1=1 



fi(Zi) 



where Var T is defined w.r.t. the P T measure. Let P° be the unconditional probability, under which 
the Zi are i.i.d. uniform on {0, f }. We define F := YliLi fi(%i)l note triat under P°, the random 
variable F has a variance that is at most 2e' 2 N. If we could have a similar bound for the variance 
of F under P T rather than under P°, this would yield an upper bound of order 1/N on the mutual 
information of interest. 

We now proceed to show that, provided the two distributions P° and P T have small Kullback- 
Leibler divergence, then the variance of F under P T is indeed of order at most N. The argument 
proceeds in several steps. 

Step 1: Relating divergence between P and P° to the divergence between the law of F 
under P T and under P°: 

Lemma 29 For each f in the support of the discrete random variable F, let p f and p® denote the 
probabilities that F = f under P T and P° respectively. Then we have: 



H(F°) - H( 



>D(p\\p° 



/ 



(22) 
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Proof For each /, let Nf denote the number of vectors z G {0, 1} for which F = f, so that 
p o = N f 2~ N . Write 

P T (z) 



/ z:F(z)=f 



Pf 



< 



5>/ log (^) +1 °g( Ar /) 



f 

H(F°)-D(p\\p°), 



log( — ) +jVlog(2) + log(p}) 



where the inequality follows by upper-bounding the entropy of a probability distribution on a set of 
size Nf by log(JV/). ■ 



Step 2: Bounding variance of F under F T given divergence constraints: 

Let F denote the expectation of F under P°, i.e., F = J2f P°/f- Also let a 2 denote the variance 
of F under P°. Note that 



Yar 1 (F) = inf E J (F - x) 



<E pT (F-F) 2 = ^Pf(f-F) 2 - 
f 



Assume that the entropy H(F T ) verifies H(F T ) > H (P°) - 6, for some S > 0. Then in view of 
(22) and the previous display, an upper bound on the variance of F under F T is provided by the 
solution of the following optimization problem: 

Maximize Y^fPf(f-F) 2 
over pf > 

such that Y^fPf = 1 



and E/P/ lo S ( §x ) < 



(23) 



It is readily seen by introducing the Lagrangian of this optimization problem, and a dual variable 
v~ > for the constraint (23) that the optimal of this convex optimization problem is achieved by 



Pf :z 



for a suitable positive constant u, where the normalization constant Z{y) is given by: 

Z{y) := Y,Py if - F)2 = E°e^ F -^ 2 . 
/ 

For this particular distribution, the divergence D(p\\p°) reads: 



Af-F) 2 



Hf - F) 2 - log Z(u)] = - log(Z(u)) + ^E°(F - F) 2 e^ F ~^ 
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so that constraint (23) reads 

- \og(Z(v)) + ^E°(F - Ffe^ F -^ 2 < 5. (24) 

This characterization in turn allows to establish the following 

Lemma 30 Let ip{y) := log Z{y ). Assume there exist a, v > such that 

va - if>(v) > 5. (25) 

Then the solution to the value of the optimization problem 23 is less than or equal to a. 

Proof Note that by Holder's inequality, function ip is convex, so that its derivative 

= Z- l {u)^ {F - F fe v{F -^ 2 

is non-decreasing. Note further that the function vtp'(v) — ip(v) appearing in the left-hand side of 
(24) is non-decreasing for non-negative v, as its derivative reads vij)"(v). Thus the value v* which 
achieves the optimum is such that 

v*ip'(v*)-ip(v*) = 5 

and the sought bound is tp'(v*). Now for a given a G 1Z, the supremum of va — ip{v) is achieved 
precisely at u such that a = ip'(v). Thus if for some v and some a Condition (25) holds, it follows 
that 

sup {va — ip(v)) > 5 = sup (va* — tp(v)) , 

V V 

where a* := tp'(v*). It follows from monotonicity of v — > vip'(v) — ip(v) that the value v 1 
where the supremum is achieved in the left-hand side, and such that a = iJj'(v'), verifies v' > v*. 
Monotonicity of tp' then implies that a > a* as announced. ■ 

Step 3: Deriving explicit bounds, using concentration results under P°. 

Define the centered and scaled random variable 

a 

Recall that after centering, each variable fi(Zi) is bounded in absolute value by e'. Thus, using the 
Azuma-Hoeffding inequality yields the following bound: 

P°(G > A) < e~ A2/2 , A>0, (26) 

and the same bound holds for P°(G < —A). To obtain the above, we used the fact that after 
centering, fi(Zi) is of the form <7j(2Zj — 1) where cr, is the standard deviation of fi(Zi). 

We now apply these to bound the value of the so-called partition function Z(v) as follows: 

Lemma 31 Let v > be such that a := va 1 < 1/2. Denoting s := 1/(1 — 2a), the partition 
function Z(v) verifies, for all A > 0, 

ZM < 1 + ^ (27) 
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Proof Write 

Z{v) =/ oo p0( e ^- F ) 2 >t)dt 

= 1 + f™F°(v(F - F) 2 > log t)dt 

= l + ^F\\G\>^dt 

= 1 + f™F°(\G\ > t)2ate at ' 2 dt 

= 1 + J °° [P°(G > t) + P°(G < -t)] 2ate at2 dt. 

Using Hoeffding's bound (26), the last term is upper-bounded by 



l + 2/ oo e-' 2 / 2 2ate a ' 2 dt =1 + 2 

= 1 + 



l-2a 
4a 



2a_ e -(t 2 /2)*(l-2a) 



l-2a 

as announced in (27). ■ 

Fix now 5 > and let as before cr 2 denote the variance of F under P°. We set out to find an 
a > that is an upper bound of its variance under P by using the previous two lemmas. In view of 
Lemma 30, it suffices to verify that for some u > 0, Condition va — > 5 holds. In view of 
Lemma 31, denoting the corresponding upper bound to ij)(v) by 



+oo otherwise, 



it suffices to find a such that for some v, av — <p(u) > 5. Maximizing va — cj){v) over v for fixed a, 
one finds that the optimal value for v is given by 



1 /6-4 



2a 2 V b ' 

where we introduced the notation b := a/ a 2 . Plugging this expression for v in va — <f>(v), we have 
that a upper-bounds the variance of interest if 



\h~l I 2(1 -4/b) 1 / 2 \ ^ . 

2& v— - iog ( i+ i- ( i-4/V 2 j- 

For 6 > 16/3, it holds that 1/2 < (1 - 4/6) 1 / 2 < 1. Thus under this condition on b, the left-hand 
side of the above is at least as large as 

2b * (1/2) - log f 1 Y-1 + 4/& J - b " l0g(6) - (1 " 1/e)b - 
We have thus established the following: 

Lemma 32 Under the Kullback-Leibler bound of 5, then the variance of F under F T is upper- 
bounded by 



a = a 2 max 



16 5 



3 1 - 1/e 



42 



Price of Privacy 



We can now complete the proof of the Theorem. An upper bound 5t on the conditional mutual in- 
formation obtained after T steps, uniformly over the sketch values observed, is evaluated recursively 
as 

A <r A u 1 2 A6 St-1 \ 

*t<St-i + w * ™*{T>T=Tfi)- 

Recalling that a 2 < 2Ne' 2 , we rewrite this for convenience as 

C 

5 T < S T -i + max(l, S T -i), 

for some suitable constant C. 

It then follows that 6 T < CT/N for T < N/C, and for T > N/C one has 

*<(i+£) T . 

Thus for any fixed exponent a > 0, in order to learn iV a bits of information about the unknown 
labels , one needs at least T = a log(iV) / log(l + C/N) = n(N log iV) samples. 
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